Goto

Collaborating Authors

 pattern detection


Enhancing Password Security Through a High-Accuracy Scoring Framework Using Random Forests

Mazelan, Muhammed El Mustaqeem, Abdul, Noor Hazlina, AlDahoul, Nouar

arXiv.org Artificial Intelligence

Password security plays a crucial role in cybersecurity, yet traditional password strength meters, which rely on static rules like character - type requirements, often fail . Such methods are easily bypassed by common password patterns (e.g., 'P@ssw0rd1!'), giving users a false sense of security . To address this, we implement and evaluate a password strength scoring system by comparing four machine learning models: Random Forest (RF), Support Vector Machine (SVM), a Convolutional Neural Network (CNN), and Logistic Regression with a dataset of over 660,000 real - world passwords. Our primary contribution is a novel hybrid feature engineering approach that captures nuanced vulnerabilities missed by standard metrics . We introduce features like leetspeak - normalized Shannon entropy to assess true randomness, pattern detection for keyboard walks and sequences, and character - level TF - IDF n - grams to identify frequently reused substrings from breached password datasets. Crucially, the interpretability of the Random Forest model allows for feature importance analysis, providing a clear pathway to developing security tools that offer specific, actionable feedback to users. This study bridges the gap betwee n predictive accuracy and practical usability, resulting in a high - performance scoring system that not only reduces password - based vulnerabilities but also empowers users to make more informed security decisions. Keywords - Password Security, Machine Learning, Rule - Based Attack, Brute - Force Attack, Dictionary Attack, Cybersecurity. 1. P asswords remain a cornerstone of online security, serving as the primary means of authentication for countless systems and applications . However, this reliance is a critical vulnerability; according to a report by Google Cloud, a staggering 86% of breaches involve stolen credentials, posing a significant threat to both user data and system security .[1] M any users choose weak, easily guessable passwords, which pose a serious threat to both user data and system security . Attackers frequently exploit this vulnerability in large - scale attacks, compromising user privacy and enabling financial fraud . Most traditional password strength scoring tools rely on static rules, such as requiring a mix of lowercase, uppercase, digits, and special characters (LUDS), which fail to adapt to evolving attack patterns .


Solution Space Topology Guides CMTS Search

Mannucci, Mirco A.

arXiv.org Artificial Intelligence

A fundamental question in search-guided AI: what topology should guide Monte Carlo Tree Search (MCTS) in puzzle solving? Prior work applied topological features to guide MCTS in ARC-style tasks using grid topology -- the Laplacian spectral properties of cell connectivity -- and found no benefit. We identify the root cause: grid topology is constant across all instances. We propose measuring \emph{solution space topology} instead: the structure of valid color assignments constrained by detected pattern rules. We build this via compatibility graphs where nodes are $(cell, color)$ pairs and edges represent compatible assignments under pattern constraints. Our method: (1) detect pattern rules automatically with 100\% accuracy on 5 types, (2) construct compatibility graphs encoding solution space structure, (3) extract topological features (algebraic connectivity, rigidity, color structure) that vary with task difficulty, (4) integrate these features into MCTS node selection via sibling-normalized scores. We provide formal definitions, a rigorous selection formula, and comprehensive ablations showing that algebraic connectivity is the dominant signal. The work demonstrates that topology matters for search -- but only the \emph{right} topology. For puzzle solving, this is solution space structure, not problem space structure.


LLM-Based Design Pattern Detection

Schindler, Christian, Rausch, Andreas

arXiv.org Artificial Intelligence

--Detecting design pattern instances in unfamiliar codebases remains a challenging yet essential task for improving software quality and maintainability. Traditional static analysis tools often struggle with the complexity, variability, and lack of explicit annotations that characterize real-world pattern implementations. In this paper, we present a novel approach leveraging Large Language Models to automatically identify design pattern instances across diverse codebases. Our method focuses on recognizing the roles classes play within the pattern instances. By providing clearer insights into software structure and intent, this research aims to support developers, improve comprehension, and streamline tasks such as refactoring, maintenance, and adherence to best practices. Identifying design pattern instances in code is a valuable goal as it enables a deeper understanding of the structural and behavioral principles underlying software systems. By uncovering these patterns, developers and other stakeholders can gain insights into code quality, maintainability, and adherence to best practices, even in unfamiliar code bases. Automating this process can significantly reduce the time and effort required for code comprehension, facilitate knowledge transfer among teams, and improve software evolution and refactoring efforts.


Sparser, Better, Faster, Stronger: Efficient Automatic Differentiation for Sparse Jacobians and Hessians

Hill, Adrian, Dalle, Guillaume

arXiv.org Artificial Intelligence

From implicit differentiation to probabilistic modeling, Jacobians and Hessians have many potential use cases in Machine Learning (ML), but conventional wisdom views them as computationally prohibitive. Fortunately, these matrices often exhibit sparsity, which can be leveraged to significantly speed up the process of Automatic Differentiation (AD). This paper presents advances in Automatic Sparse Differentiation (ASD), starting with a new perspective on sparsity detection. Our refreshed exposition is based on operator overloading, able to detect both local and global sparsity patterns, and naturally avoids dead ends in the control flow graph. We also describe a novel ASD pipeline in Julia, consisting of independent software packages for sparsity detection, matrix coloring, and differentiation, which together enable ASD based on arbitrary AD backends. Our pipeline is fully automatic and requires no modification of existing code, making it compatible with existing ML codebases. We demonstrate that this pipeline unlocks Jacobian and Hessian matrices at scales where they were considered too expensive to compute. On real-world problems from scientific ML and optimization, we show significant speed-ups of up to three orders of magnitude. Notably, our ASD pipeline often outperforms standard AD for one-off computations, once thought impractical due to slower sparsity detection methods.


How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension

Dai, Xinnan, Qu, Haohao, Shen, Yifen, Zhang, Bohang, Wen, Qihao, Fan, Wenqi, Li, Dongsheng, Tang, Jiliang, Shan, Caihua

arXiv.org Artificial Intelligence

Benchmarking the capabilities and limitations of large language models (LLMs) in graph-related tasks is becoming an increasingly popular and crucial area of research. Recent studies have shown that LLMs exhibit a preliminary ability to understand graph structures and node features. However, the potential of LLMs in graph pattern mining remains largely unexplored. This is a key component in fields such as computational chemistry, biology, and social network analysis. To bridge this gap, this work introduces a comprehensive benchmark to assess LLMs' capabilities in graph pattern tasks. We have developed a benchmark that evaluates whether LLMs can understand graph patterns based on either terminological or topological descriptions. Additionally, our benchmark tests the LLMs' capacity to autonomously discover graph patterns from data. The benchmark encompasses both synthetic and real datasets, and a variety of models, with a total of 11 tasks and 7 models. Our experimental framework is designed for easy expansion to accommodate new models and datasets. Our findings reveal that: (1) LLMs have preliminary abilities to understand graph patterns, with O1-mini outperforming in the majority of tasks; (2) Formatting input data to align with the knowledge acquired during pretraining can enhance performance; (3) The strategies employed by LLMs may differ from those used in conventional algorithms. Originally trained on textual data, LLMs have demonstrated remarkable success in various tasks, such as reading comprehension and text reasoning (Achiam et al., 2023; Touvron et al., 2023). To evaluate whether LLMs can adapt the text understanding ability across graphs, several studies have investigated this at both the feature and structural levels (Zhao et al., 2023; Chai et al., 2023). Specifically, LLMs have been shown to enhance node features in social networks (Ye et al., 2023; Huang et al., 2024). Additionally, graph structure understanding tasks, such as shortest path and connectivity, have also been evaluated (Guo et al., 2023; Wang et al., 2024). Graph patterns, a key aspect of graphs, have yet to be thoroughly explored.


Reducing False Discoveries in Statistically-Significant Regional-Colocation Mining: A Summary of Results

Ghosh, Subhankar, Gupta, Jayant, Sharma, Arun, An, Shuai, Shekhar, Shashi

arXiv.org Artificial Intelligence

Given a set \emph{S} of spatial feature types, its feature instances, a study area, and a neighbor relationship, the goal is to find pairs $<$a region ($r_{g}$), a subset \emph{C} of \emph{S}$>$ such that \emph{C} is a statistically significant regional-colocation pattern in $r_{g}$. This problem is important for applications in various domains including ecology, economics, and sociology. The problem is computationally challenging due to the exponential number of regional colocation patterns and candidate regions. Previously, we proposed a miner \cite{10.1145/3557989.3566158} that finds statistically significant regional colocation patterns. However, the numerous simultaneous statistical inferences raise the risk of false discoveries (also known as the multiple comparisons problem) and carry a high computational cost. We propose a novel algorithm, namely, multiple comparisons regional colocation miner (MultComp-RCM) which uses a Bonferroni correction. Theoretical analysis, experimental evaluation, and case study results show that the proposed method reduces both the false discovery rate and computational cost.


Pinaki Laskar on LinkedIn: #ai #datascience #programming

#artificialintelligence

What new technologies will be developed in the next decade? It is Causal Learning AI Machines (CLAIMs). It is highlighted by the Gartner Hype Cycle of Emerging Technology 2022 highlighting technologies that will significantly affect business, society and people over the next 2 to 10 years. The CLAIMs top functions is to intelligently identify causal patterns in the data universe of data classes and sets and elements. The CLAIMs act as primary decision makers in government and public bodies, businesses and others, in economical/financial industries, to govern economies and finances, automate trading decisions and detect prospective directions and investment opportunities, in legal sectors, to provide legal advice to individuals and small businesses.


House Price Prediction using a Random Forest Classifier

#artificialintelligence

In this blog post, I will use machine learning and Python for predicting house prices. I will use a Random Forest Classifier (in fact Random Forest regression). In the end, I will demonstrate my Random Forest Python algorithm! There is no law except the law that there is no law. Data Science is about discovering hidden patterns (laws) in your data.


Automated Supervised Feature Selection for Differentiated Patterns of Care

Wanjiru, Catherine, Ogallo, William, Tadesse, Girmaw Abebe, Wachira, Charles, Mulang', Isaiah Onando, Walcott-Bryant, Aisha

arXiv.org Artificial Intelligence

An automated feature selection pipeline was developed using several state-of-the-art feature selection techniques to select optimal features for Differentiating Patterns of Care (DPOC). The pipeline included three types of feature selection techniques; Filters, Wrappers and Embedded methods to select the top K features. Five different datasets with binary dependent variables were used and their different top K optimal features selected. The selected features were tested in the existing multi-dimensional subset scanning (MDSS) where the most anomalous subpopulations, most anomalous subsets, propensity scores, and effect of measures were recorded to test their performance. This performance was compared with four similar metrics gained after using all covariates in the dataset in the MDSS pipeline. We found out that despite the different feature selection techniques used, the data distribution is key to note when determining the technique to use.


The Upsurge of Artificial Intelligence Scientist

#artificialintelligence

Artificial Intelligence (AI) is already underway. Perhaps, not the way you may have been led to think. Though AI has been a recurring topic since the 1950s, it is only now that the field started gaining traction due to the advancement in technology and algorithms. Most companies are excited to join the new fray of the AI trend. With modern AI, deep learning techniques, and natural language processing (NLP), organizations are ready to embark on the AI journey.